Machine learning approaches to predict the need for intensive care unit admission among Iranian COVID‐19 patients based on ICD‐10: A cross‐sectional study

Abstract Background & Aim Timely identification of the patients requiring intensive care unit admission (ICU) could be life‐saving. We aimed to compare different machine learning algorithms to predict the requirements for ICU admission in COVID‐19 patients. Methods We screened all patients with COVID‐19 at six academic hospitals in Tehran comprising our study population. A total of 44,112 COVID‐19 patients (≥18 years old) were included, among which 7722 patients were hospitalized. We used a Random Forest algorithm to select significant variables. Then, prediction models were developed using the Support Vector Machine, Naıve Bayes, logistic regression, lightGBM, decision tree, and K‐Nearest Neighbor algorithms. Sensitivity, specificity, accuracy, F1 score, and receiver operating characteristic‐Area Under the Curve (AUC) were used to compare the prediction performance of different models. Results Based on random Forest, the following predictors were selected: age, cardiac disease, cough, hypertension, diabetes, influenza & pneumonia, malignancy, and nervous system disease. Age was found to have the strongest association with ICU admission among COVID‐19 patients. All six models achieved an AUC greater than 0.60. Naıve Bayes achieved the best predictive performance (AUC = 0.71). Conclusion Naïve Bayes and lightGBM demonstrated promising results in predicting ICU admission needs in COVID‐19 patients. Machine learning models could help quickly identify high‐risk patients upon entry and reduce mortality and morbidity among COVID‐19 patients.


| INTRODUCTION
In December 2019, a novel coronavirus known as SARS-CoV-2 emerged in Wuhan, China, quickly spreading across the globe and leading to the COVID-19 pandemic.According to the World Health Organization (WHO), through May 2023, 764,991,756 confirmed cases of COVID-19 with 6,931,081 deaths were reported worldwide. 1In Iran, during the same period, there have been 7,609,922 confirmed cases of COVID-19 with 146,165 deaths.The World Health Organization has declared that COVID-19 is no longer a "global health emergency" while emphasizing that it remains a global health threat. 2COVID-19 is primarily spread through respiratory droplets and close contact, making it highly contagious and difficult to control.The symptoms of COVID-19 may differ considerably; some people only suffer from mild to severe symptoms, while others require to be hospitalized. 3Studies have shown that the mortality rate for COVID-19 infections is around 0.66%, ranging from 0.04% in those under ten years of age to 16.6% in those over 70, with as many as one in five COVID-19 patients over 80 requiring hospitalization. 4 the COVID-19 pandemic, demand for the Intensive Care Unit (ICU) has significantly risen due to the disease's highly contagious nature.5 Studies estimate 5% to 32% of COVID-19 patients need ICU care.6 A variety of factors, including age, sex, and comorbidities, have been attributed in studies to the severity of the disease and ICU hospitalization.6,7 Patients with severe COVID-19 may experience acute kidney injury, acute respiratory distress syndrome (ARDS), myocarditis, and cardiac shock.These patients are usually admitted to the ICU, which can reduce death rates.[6][7][8] In previous studies, artificial intelligence, such as machine learning, has shown better predictive capabilities than traditional statistical methods.9 Machine learning (ML) is a powerful tool for analyzing large datasets that conventional statistical methods struggle to handle. M has become essential in various fields, with its ability to process unwieldy data, accommodate nonlinear interactions, and offer flexibility with assumptions.10 Artificial intelligence has revolutionized the diagnosis and outcome prediction in COVID-19 patients.1][12][13][14] In addition, ML can predict prognosis and mortality among COVID-19 patients based on ferritin, D-dimer, and procalcitonin.[15][16][17] Previous research has concentrated on forecasting hospitalization risk in the intensive care unit by leveraging different clinical and demographic factors.However, some clinical variables can only be obtained via time-consuming tests such as C-reactive protein, D-dimer, arterial blood gas, serum biochemical tests, and complete blood count.Moreover, earlier studies have utilized small sample sizes in making these predictions.18 In the present study, using a large data set of COVID-19 patients, different ML algorithms, including Decision Tree, Random Forest, Support Vector Machine, Naïve Bayes, K-Nearest Neighbor, and LightGBM were tested.We sought to evaluate the predictive abilities of the six ML models in identifying indications for ICU admission.
In this study, data preprocessing was applied before the training of the ML models.The primary data set included 50361 COVID-19 patients identified with ICD-10 codes u07.1 and u07.2, indicating the presence of COVID-19 based on laboratory testing and clinical data without laboratory testing, respectively.Individuals under 18 and those who passed away within 24 h of admission or had missing outcome data were excluded from the study.The final data set includes 44,112 patients, exclusively consisting of individuals who have been diagnosed with COVID-19, among which 7722 were admitted to the intensive care unit (ICU).This imbalanced input would result in biased conclusions toward the dominant class.
The synthetic minority over-sampling technique (SMOTE) addressed the imbalanced data set.The SMOTE approach generates synthetic samples of the minority class using randomly chosen instances of the minority class and their k nearest neighbors, most often using an artificial oversampling technique. 22This technique selects a random data instance along with its k nearest neighbors.The second data instance would then be chosen from the list of k closest neighbors.
A new synthetic sample is created as a convex combination along the line that connects the two samples.The minority and majority classes would then be balanced out by repeating this process. 23The SMOTE approach minimized the danger of overfitting compared to a random oversampling method.

| Data analysis
The continuous variable in this model is age, which was normalized.
Before starting model training, variables in each data set were analyzed.The correlations were investigated among variables indicated in a matrix.Variables with high correlation coefficients (r > 0.9) should be removed to prevent overfitting, yet no significant correlation was observed among the variables.The Random Forest method, a machine learning approach adept at dealing with categorical and continuous variables, was used for feature selection based on two criteria: Mean Decrease Gini (MDG) and Mean Decrease Accuracy (MDA).MDG represents the average decrease in a variable's node impurity, weighed by the sample's proportion reaching that node in each decision tree within the Random Forest.
5][26] MDG is favored as a criterion for important variable selection due to its capability to manage missing data and detect variable interactions.Finally, we used selected variables based on MDG for model building.

| Model development
The data set was randomly partitioned into training and test subsets.The training comprised 70% (30,877 patients) and the test comprised 30% (13,235 patients) of the total study population.To address potential imbalances in the data, we used the SMOTE approach.Furthermore, the 10-fold cross-validation was utilized for training the model and tuning the hyperparameter.This process entailed training the model on nine of the 10 folds of the training data, with validation on the remaining fold.Subsequently, the efficacy of the final models was assessed using the test data following hyperparameter adjustments.
We utilized six machine learning models, including logistic regression (LR), Support Vector Machine (SVM), Naïve Bayes (NB), K-nearest neighbor (KNN), Decision Tree (DT), and Light gradientboosting machine (LightGBM).Logistic regression is a linear model which is commonly used for binary classification.SVM is a supervised learning model that can perform classification and regression tasks on data.It finds the best hyperplane that separates the data points into different classes or predicts their values.It can also handle nonlinear problems using a kernel function that transforms the data into a higher-dimensional space. 27The Naïve Bayes classifier is a supervised machine learning algorithm for tasks like text classification.Its main objective is to model the input distribution of a specific category or class. 28The KNN classifier is a non-parametric method that assigns an unidentified object to a class matching most of its k-closest neighbors. 29The decision tree is a supervised learning method primarily employed for classification tasks, although it can also handle regression. 30,31Finally, LightGBM is a gradient-boosting framework that uses tree-based learning algorithms.It is designed to be fast and efficient, using histogram-based algorithms to reduce the number of split points and leaf-wise algorithms to grow the trees.It can also handle large-scale data and support various objective functions. 32We trained these models using the training data set and tested their performance on the validation data set.
T A B L E 1 Confusion matrix for binary outcome.The receiver operating characteristic (ROC) curves, displayed as sensitivity against 1-specificity, were used, and AUC was calculated. 33Moreover, two-sided P values were used to demonstrate the comparison of the AUCs of machine learning models and logistic regression using MedCalc software (version 22.021). 34The best prediction model was ultimately selected based on performance.
Statistical analysis was done using the R statistical language (version 4.1.2;R Core Team, 2021) and Python programming language (version 3.10, Python Software Foundation, 2021).The process of dividing the data into train and test sets and constructing and assessing the model was facilitated by the Scikit-learn library. 35Data preprocessing and graphical representation were performed using Pandas 36 and NumPy 37 libraries.admitted to the ICU and those not admitted reveals substantial differences in prevalence.Among the patients admitted to the ICU, influenza & pneumonia emerge as the most prevalent comorbidity, affecting 25.4% individuals, which is more than double the prevalence observed in patients not admitted to the ICU (11.3%).

Data
Similarly, diabetes and hypertension also exhibit higher prevalence The missing data during the preprocessing phase and the final analyzed data set.rates among ICU-admitted patients, with rates of 14.3% and 12.7%, respectively, compared to 5.7% and 5.0% among non-ICU patients.The descriptive statistics for the variables are shown in Table 2.
Figure 2 shows the coefficient of correlation between variables.
A moderate to weak correlation existed between fever and cough (0.45) and cough and dyspnea (0.54).The Random Forest approach was used to determine the importance of the predictor variables, as indicated in Figure 3. Age, cardiac disease, cough, hypertension, diabetes, influenza & pneumonia, malignancy, and nervous system disease were selected according to MDG.The optimal hyperparameters for each model were obtained using CV, as shown in Table 3.
The performance of the machine learning models in the validation data set is shown in Table 4. NB's AUC, specificity, and sensitivity were 0.71, 0.67, and 0.63, respectively.The AUC of NB was significantly different from the AUC of LR (p = 0.020).The AUC for lightGBM was quite similar to NB.The AUC of LightGBM was 0.70.
The sensitivity and specificity of NB were 0.61 and 0.68, respectively.
Figure 4 shows the ROC curve to compare several models for prediction.studies mainly focused on mortality, [38][39][40] while we considered admission to the ICU.The most important predictors in our study were age, cardiac disease, cough, hypertension, diabetes, influenza & pneumonia, malignancy, and nervous system disorders.Our analysis found that the NB model had the highest AUC (0.71), followed closely by the LightGBM (AUC = 0.70).Adding many variables to the machine learning model can lower its accuracy.Therefore, we used Random Forest to select important variables considering their correlations.5][46] Moreover, we used AUC to evaluate our model's efficiency, as indicated in similar studies. 47,48veral studies have applied machine learning methods to predict the risk of hospitalization of COVID-19 patients in the ICU.For instance, a study by Magunia et al. 49 developed a model to predict

| DISCUSSION
The importance of predictor variable in the Random Forest model.Moreover, blood gas parameters and biomarkers have been used to automatically diagnose and predict prognosis among COVID-19 patients. 10,11In this regard, Huyut et al., using the Chi-squared Automatic Interaction Detector decision tree model, found that low serum ionized calcium (<1.10 mM) significantly predicted ICU admission in COVID-19 patients. 52spite using a larger data set than other studies, our study had several limitations.Lack of high-quality data is one issue, particularly in low-and middle-income countries with limited access to healthcare.In addition, due to the absence of consistent data collection and reporting, it may be difficult to compare and generalize findings across different populations.For instance, there may have been selection bias due to the exclusion of individuals who passed away within 24 h of arrival because these patients may have had more severe diseases.Second, when the data was gathered, just 16% of the population in Iran had received their first vaccination.As a result, we could not include vaccination status in our model.Third, as the virus mutates and healthcare policies and recommendations change, the course of the disease may impact how well these models work.Finally, ICD-10 codes might lead to disease misclassification or inaccuracies.This results in less precise input data for our model and subsequently, moderate performance.

| CONCLUSION
Our study used machine learning models to provide helpful insights into patient management and treatment, particularly in high-risk populations, by identifying the most critical characteristics of ICU admission in COVID-19 patients.Naïve Bayes and lightGBM demonstrated promising results among several algorithms.However, the features derived from ICD-10 codes might not have been sufficiently informative or representative of the critical factors that influence ICU admission, which could be responsible for the moderate performance of our models.Therefore, further research is needed to determine the most effective models and to ensure their practical implementation in clinical settings.
from 50361 COVID-19 patients was collected.The details about excluded patients during preprocessing have been demonstrated in Figure 1.The final data set included 44,112 patients, and 45.5% were women.7722 (17.5%) of patients were admitted to the ICU with a mean age of 61.08 (SD = 16.66,range: 18-120) years old.The comparison of common comorbidities between patients This study utilized a retrospective analysis of clinical data from medical records of COVID-19 patients.We investigated the risk factors for ICU admission in COVID-19 patients and developed a predictive model using machine learning algorithms.Some previous F I G U R E 2 Correlation matrix between all variables.p < 0.05 is considered significant.KARIMI ET AL. | 5 of 10
We gauged the models' performance through four metrics: accuracy, sensitivity, specificity, area under the curve (AUC), and F1 score.These metrics are computed based on Table1and formulas 1-4.
Abbreviations: ICU, intensive care unit; NPV, negative predictive value The patients' characteristics by ICU admission status.
Note: Age is presented by mean (SD).All other data is shown in number (%).Abbreviations: ICU, intensive care unit; SD, standard deviation.
T A B L E 4 Results of all prediction models on the need for ICU admission in COVID-19 patients.Data is presented by estimate (confidence interval).Abbreviations: ACC, accuracy; AUC, area under the curve, CI, confidence interval; DT, Decision Tree; KNN, K-Nearest neighbor; LightGBM: light gradientboosting machine; LR, logistic regression, NB, Naïve Bayes; SE, sensitivity; SP, specificity; SVM, support vector machine.
*p value < 0.05 is considered significant.F I G U R E 4The receiver operating characteristic (ROC) curves for all models.AUC, area under the curve; DT, Decision Tree, KNN, K-Nearest neighbor, LightGBM, light gradient-boosting machine; LR, logistic regression; NB, Naïve Bayes; SVM, support vector machine.